Python Collections (Mappings & Streams)

Mappings

  • A mapping is a mutable unordered collection of key/value pairs.
  • Data structures implementing mappings, including associative arrays, lookup tables, and hash tables.

Dictionaries

  • There is just one mapping type in Python: dict, for "dictionary".
  • dict can be called with a collection argument to create a dictionary with the elements of the argument.
  • The elements must be tuples or lists of two elements - a key and a value:
In [2]:
dict((('A','adenine'),('T', 'thymine'), ('C','cytosine'),('G','guanine')))
Out[2]:
{'A': 'adenine', 'T': 'thymine', 'C': 'cytosine', 'G': 'guanine'}
  • Dictionaries can also be written as a comma-separated list of key/value pairs enclosed in curly braces, with each key and value separated by a colon.
  • Empty braces create an empty dictionary.
  • The order within the braces doesn’t matter, since the dictionary implementation imposes its own order.
In [3]:
{'A': 'adenine', 'C': 'cytosine', 'G': 'guanine', 'T': 'thymine'}
Out[3]:
{'A': 'adenine', 'C': 'cytosine', 'G': 'guanine', 'T': 'thymine'}
  • The keys of a mapping must be unique within the collection.
  • dict does not allow keys to be instances of mutable built-in types.

Dictionary example: RNA codon translation table

In [62]:
RNA_codon_table = {
#                        Second Base
#        U             C             A             G
# U
    'UUU': 'Phe', 'UCU': 'Ser', 'UAU': 'Tyr', 'UGU': 'Cys',     # UxU
    'UUC': 'Phe', 'UCC': 'Ser', 'UAC': 'Tyr', 'UGC': 'Cys',     # UxC
    'UUA': 'Leu', 'UCA': 'Ser', 'UAA': '---', 'UGA': '---',     # UxA
    'UUG': 'Leu', 'UCG': 'Ser', 'UAG': '---', 'UGG': 'Urp',     # UxG
# C
    'CUU': 'Leu', 'CCU': 'Pro', 'CAU': 'His', 'CGU': 'Arg',     # CxU
    'CUC': 'Leu', 'CCC': 'Pro', 'CAC': 'His', 'CGC': 'Arg',     # CxC
    'CUA': 'Leu', 'CCA': 'Pro', 'CAA': 'Gln', 'CGA': 'Arg',     # CxA
    'CUG': 'Leu', 'CCG': 'Pro', 'CAG': 'Gln', 'CGG': 'Arg',     # CxG
# A
    'AUU': 'Ile', 'ACU': 'Thr', 'AAU': 'Asn', 'AGU': 'Ser',     # AxU
    'AUC': 'Ile', 'ACC': 'Thr', 'AAC': 'Asn', 'AGC': 'Ser',     # AxC
    'AUA': 'Ile', 'ACA': 'Thr', 'AAA': 'Lys', 'AGA': 'Arg',     # AxA
    'AUG': 'Met', 'ACG': 'Thr', 'AAG': 'Lys', 'AGG': 'Arg',     # AxG
# G
    'GUU': 'Val', 'GCU': 'Ala', 'GAU': 'Asp', 'GGU': 'Gly',     # GxU
    'GUC': 'Val', 'GCC': 'Ala', 'GAC': 'Asp', 'GGC': 'Gly',     # GxC
    'GUA': 'Val', 'GCA': 'Ala', 'GAA': 'Glu', 'GGA': 'Gly',     # GxA
    'GUG': 'Val', 'GCG': 'Ala', 'GAG': 'Glu', 'GGG': 'Gly'      # GxG
}
In [63]:
def translate_RNA_codon(codon):
    """RNA codon lookup from a dictionary"""
    return RNA_codon_table[codon]
In [64]:
translate_RNA_codon('GUG')
Out[64]:
'Val'
In [7]:
RNA_codon_table
Out[7]:
{'UUU': 'Phe',
 'UCU': 'Ser',
 'UAU': 'Tyr',
 'UGU': 'Cys',
 'UUC': 'Phe',
 'UCC': 'Ser',
 'UAC': 'Tyr',
 'UGC': 'Cys',
 'UUA': 'Leu',
 'UCA': 'Ser',
 'UAA': '---',
 'UGA': '---',
 'UUG': 'Leu',
 'UCG': 'Ser',
 'UAG': '---',
 'UGG': 'Urp',
 'CUU': 'Leu',
 'CCU': 'Pro',
 'CAU': 'His',
 'CGU': 'Arg',
 'CUC': 'Leu',
 'CCC': 'Pro',
 'CAC': 'His',
 'CGC': 'Arg',
 'CUA': 'Leu',
 'CCA': 'Pro',
 'CAA': 'Gln',
 'CGA': 'Arg',
 'CUG': 'Leu',
 'CCG': 'Pro',
 'CAG': 'Gln',
 'CGG': 'Arg',
 'AUU': 'Ile',
 'ACU': 'Thr',
 'AAU': 'Asn',
 'AGU': 'Ser',
 'AUC': 'Ile',
 'ACC': 'Thr',
 'AAC': 'Asn',
 'AGC': 'Ser',
 'AUA': 'Ile',
 'ACA': 'Thr',
 'AAA': 'Lys',
 'AGA': 'Arg',
 'AUG': 'Met',
 'ACG': 'Thr',
 'AAG': 'Lys',
 'AGG': 'Arg',
 'GUU': 'Val',
 'GCU': 'Ala',
 'GAU': 'Asp',
 'GGU': 'Gly',
 'GUC': 'Val',
 'GCC': 'Ala',
 'GAC': 'Asp',
 'GGC': 'Gly',
 'GUA': 'Val',
 'GCA': 'Ala',
 'GAA': 'Glu',
 'GGA': 'Gly',
 'GUG': 'Val',
 'GCG': 'Ala',
 'GAG': 'Glu',
 'GGG': 'Gly'}
  • To obtain a function that will help you see the structure of your data, you should include the following line in your Python files:
In [8]:
from pprint import pprint as pp
In [9]:
pp(RNA_codon_table)
{'AAA': 'Lys',
 'AAC': 'Asn',
 'AAG': 'Lys',
 'AAU': 'Asn',
 'ACA': 'Thr',
 'ACC': 'Thr',
 'ACG': 'Thr',
 'ACU': 'Thr',
 'AGA': 'Arg',
 'AGC': 'Ser',
 'AGG': 'Arg',
 'AGU': 'Ser',
 'AUA': 'Ile',
 'AUC': 'Ile',
 'AUG': 'Met',
 'AUU': 'Ile',
 'CAA': 'Gln',
 'CAC': 'His',
 'CAG': 'Gln',
 'CAU': 'His',
 'CCA': 'Pro',
 'CCC': 'Pro',
 'CCG': 'Pro',
 'CCU': 'Pro',
 'CGA': 'Arg',
 'CGC': 'Arg',
 'CGG': 'Arg',
 'CGU': 'Arg',
 'CUA': 'Leu',
 'CUC': 'Leu',
 'CUG': 'Leu',
 'CUU': 'Leu',
 'GAA': 'Glu',
 'GAC': 'Asp',
 'GAG': 'Glu',
 'GAU': 'Asp',
 'GCA': 'Ala',
 'GCC': 'Ala',
 'GCG': 'Ala',
 'GCU': 'Ala',
 'GGA': 'Gly',
 'GGC': 'Gly',
 'GGG': 'Gly',
 'GGU': 'Gly',
 'GUA': 'Val',
 'GUC': 'Val',
 'GUG': 'Val',
 'GUU': 'Val',
 'UAA': '---',
 'UAC': 'Tyr',
 'UAG': '---',
 'UAU': 'Tyr',
 'UCA': 'Ser',
 'UCC': 'Ser',
 'UCG': 'Ser',
 'UCU': 'Ser',
 'UGA': '---',
 'UGC': 'Cys',
 'UGG': 'Urp',
 'UGU': 'Cys',
 'UUA': 'Leu',
 'UUC': 'Phe',
 'UUG': 'Leu',
 'UUU': 'Phe'}


  • Last three methods return "sequence-like objects": they aren't sequences, but they can be used as if they were in many contexts.
In [10]:
list(RNA_codon_table.keys())
Out[10]:
['UUU',
 'UCU',
 'UAU',
 'UGU',
 'UUC',
 'UCC',
 'UAC',
 'UGC',
 'UUA',
 'UCA',
 'UAA',
 'UGA',
 'UUG',
 'UCG',
 'UAG',
 'UGG',
 'CUU',
 'CCU',
 'CAU',
 'CGU',
 'CUC',
 'CCC',
 'CAC',
 'CGC',
 'CUA',
 'CCA',
 'CAA',
 'CGA',
 'CUG',
 'CCG',
 'CAG',
 'CGG',
 'AUU',
 'ACU',
 'AAU',
 'AGU',
 'AUC',
 'ACC',
 'AAC',
 'AGC',
 'AUA',
 'ACA',
 'AAA',
 'AGA',
 'AUG',
 'ACG',
 'AAG',
 'AGG',
 'GUU',
 'GCU',
 'GAU',
 'GGU',
 'GUC',
 'GCC',
 'GAC',
 'GGC',
 'GUA',
 'GCA',
 'GAA',
 'GGA',
 'GUG',
 'GCG',
 'GAG',
 'GGG']

Streams

  • A stream is a temporally ordered sequence of indefinite length, usually limited to one type of element.
  • Each stream has two ends: a source that provides the elements and a sink that absorbs the elements.
  • The more common kinds of stream sources are files, network connections, and the output of a kind of function called a generator.
  • Files and network sources are also common kinds of sinks.

Files

  • A Python file is an object that is an interface to an external file, not the file itself.
  • File objects provide methods for reading, writing, and managing their instances.
  • Depending on a parameter supplied when an instance is created, the elements of the file object are either bytes or Unicode characters.
  • Some methods treat files as streams of bytes or characters, and other methods treat them as streams of lines of bytes or characters.
  • Most of the time a file object is a one-way sequence: it can either be read from or written to.
  • It is possible to create a file object that is a two-way stream, though it would be more accurate to say it is a pair of streams—one for reading and one for writing—that just happen to connect to the same external file.
  • Normally when a file object is created, if there was already a file with the same path that file is emptied.
  • File objects can be created to append instead, though, so that data is written to the end of an existing file. #### Working with file objects
  • built-in function open(path, mode) creates a file object representing the external file at the operating system location specified by the string path.
  • The default use is reading, and the default interpretation is text.

  • call the method close() to close a file object when it’s no longer needed
  • The with statement is used to open and name a file, then automatically close the file regardless of whether an error occurs during the execution of its statements.

    with open(path, mode) as name: statements using name

  • More than one file can be opened with the same with statement, as when reading from one and writing to the other.

    with open(path1, mode1) as name1, open(path2, mode2) as name2, ... : statements using names

File reading

  • fileobj.read([count]) - Reads count bytes, or until the end of the file, whichever comes first; if count is omitted, reads everything until the end of the file. If at the end of the file, returns an empty string. This method treats the file as an input stream of characters.
  • fileobj.readline([count]) - Reads one line from the file object and returns the entire line, including the end-of-line character; if count is present, reads at most count characters. If at the end of the file, returns an empty string. This method treats the file as an input stream of lines.
  • fileobj.readlines() - Reads lines of a file object until the end of the file is reached and returns them as a list of strings; this method treats the file as an input stream of lines.

File Writing

  • fileobj.write(string) - Writes string to fileobj , treating it as an output stream of characters.
  • fileobj.writelines(sequence) - Writes each element of sequence , which must all be strings, to fileobj, treating it as an output stream of lines.
In [1]:
def read_FASTA_strings(filename):
    """Read FASTA sequence from a file"""
    with open(filename) as file:
        return file.read().split('>')[1:]
In [2]:
seqs = read_FASTA_strings("data/aa003.fasta")
In [3]:
seqs
Out[3]:
['gi|6693803|gb|AAF24990.1|AF121349_4 (AF121349) late expression factor 5 [Neodiprion sertifer nucleopolyhedrovirus]\nMPPCSEKTLKDIEEIFLKFRRKKKWEDLIRYLKYKQPKCVKTFNLTGTGHKYHAMWAYNPITDKREKKQISLDVMKIQEL\nHRITNNNSKLYVEIRKIMTDDHRCPCEEIKNYMQQIAEYKNNRSNKVFNTPPTKIVPNALEKILKNFTINLMIDKKPKKK\nITKSAHTIKHPPVLNIDYEHTLEFAGQTTVKEICKHASLGDTIEIQNRSFDEMVNLYTTCVQCKQMYKIQ\n',
 'gi|6693805|gb|AAF24991.1| (AF125506) astacin family metalloendopeptidase FARM-1 [Hydra vulgaris]\nMSSSNHIHVLRAIDEYHKHTCLKFVKRTNQDAYLSFYPGGGCSSLVGYVRGRINDVSLAGGCLRLGTVMHEIGHSIGLYH\nEQSRPDRDDHVTIIWNNIQSNMRFNFDKFDRNKINSLGFPYDYESMMHYESNAFGGGQVTIRTKDPSKQKLIGNRQGFSE\nIDKQQINAMYNCNRGGSTLPPSVPPTVSPVAQCVEGQDLDNRCLGWATSGYCTATDPAHLETMKKKCCKSCKESAICNDK\nNTRCDEWAKKGECKANPNWMLGNCSKSCLVC\n',
 'gi|6693816|gb|AAF24994.1|AF129447_1 (AF129447) RpoB [Klebsiella ornithinolytica]\nAAVKEFFGSSQLSQFMDQNNPLSEITHKRRISALGPGGLTRERAGFEVRDVHPTHYGRVCPIETPEGPNIGLINSLSVYA\nQTNEYGFLETPYRKVTDGVVTDEIHYLSAIEEGNYVIAQANSNLDDEGHFVEDLVTCRSKGESSLFSRDQVDYMDVSTQQ\nVVSVGGSSERVL\n']
  • Problems with output:
    • the description line preceding each sequence is part of the sequence string
    • the string contains internal newline characters.

Generators

  • A generator is an object that returns values from a series it computes. eg. random.randint
  • Advantages of generators:
    • A generator can produce an infinitely large series of values, as in the case of random.randint
    • A generator can encapsulate significant computation with the caller requesting values until it finds one that meets some condition
    • A generator can take the place of a list when the list is so long and/or its values are so large that creating the entire list before processing its elements would use enormous amounts of memory.
  • A value is obtained from a generator by calling the built-in function next with the generator object as its argument.
  • The function that produced the generator object resumes its execution until a yield statement is encountered. The value of the yield is returned as the value of next.
  • The values of parameters and names assigned in the function are retained between calls.

    next(generator[, default]) - Gets the next value from the generator object; if the generator has no more values to produce, returns default, raising an error if no default value was specified.

In [1]:
def genTest():
    yield 1
    yield 2
In [2]:
genTest()
Out[2]:
<generator object genTest at 0x7f8f40238620>
In [7]:
foo = genTest()
In [8]:
foo.__next__()
Out[8]:
1
In [9]:
for n in genTest():
    print(n)
1
2
In [11]:
def genFib():
    fibn_1 = 1  # fib(n - 1)
    fibn_2 = 0  # fib(n - 2)
    while True:
        next = fibn_1 + fibn_2 # fib(n) = fib(n - 1) + fib(n - 2)
        yield next
        fibn_2 = fibn_1
        fibn_1 = next
In [13]:
fib = genFib()
In [16]:
for i in range(10):
    print(fib.__next__())
17711
28657
46368
75025
121393
196418
317811
514229
832040
1346269

Comprehensions

  • A comprehension creates a set, list, or dictionary from the results of evaluating an expression for each element of another collection.
  • Each kind of comprehension is written surrounded by the characters used to surround the corresponding type of collection value: brackets for lists, and braces for sets and dictionaries.

List comprehensions

The simplest form of list comprehension is: [expression for item in collection]

In [17]:
def validate_base_sequence(base_sequence, RNAflag = False):
    valid_bases = 'UCAG' if RNAflag else 'TCAG'
    return all([(base in valid_bases)
                for base in base_sequence.upper()])
In [18]:
from random import randint

def random_base(RNAflag = False):
    return ('UCAG' if RNAflag else 'TCAG')[randint(0,3)]

def random_codon(RNAflag = False):
    return random_base(RNAflag) + random_base(RNAflag) + random_base(RNAflag)

def random_codons(minlength = 3, maxlength = 10, RNAflag = False):
    """Generate a random list of codons (RNA if RNAflag, else DNA)
    between minlength and maxlength, inclusive"""
    return [random_codon(RNAflag)
            for n in range(randint(minlength, maxlength))]
In [19]:
minlength = 2
maxlength = 5
RNAflag = True
In [51]:
randnum = randint(minlength, maxlength)
randnum
Out[51]:
2
In [52]:
[n for n in range(randnum)]
Out[52]:
[0, 1]
In [61]:
[random_codon(RNAflag) for n in range(randnum)]
Out[61]:
['ACC', 'UAA']
In [65]:
def random_codons_translation(minlength = 3, maxlength = 10):
    """Generate a random list of codons between minlength and
    maxlength, inclusive"""
    return [translate_RNA_codon(codon) for codon in
            random_codons(minlength, maxlength, True)]
    
In [80]:
random_codons_translation()
Out[80]:
['Asn', 'Ser', 'Lys', 'Ile', 'Pro', 'Lys', 'Leu', 'Ser']
In [12]:
def test():
    print()
    print(random_base())
    print(random_base())
    print(random_base(False))
    print(random_base(False))
    print()
    print(random_base(True))
    print(random_base(True))
    print(random_base(True))
    print(random_base(True))
    print()
    print(random_codon())
    print(random_codon(False))
    print(random_codon(True))
    print()
    print(random_codons())
    print(random_codons())
    print(random_codons())
    print(random_codons())
    print()
    print(random_codons(6))
    print(random_codons(6, 15))
    print()
    print(random_codons(RNAflag = True))
    print(random_codons(RNAflag = True))
    print()
    print(random_codons_translation())
    print(random_codons_translation(5))
    print()
    print(random_codons_translation(8, 12))
    print(random_codons_translation(8, 12))
test()
G
T
T
C

C
G
U
G

GTA
AGA
UCU

['CTC', 'GAT', 'ATC']
['ACA', 'ATC', 'ATG', 'CGT', 'GTA', 'CAG', 'GGT', 'CGA']
['CAA', 'TCT', 'CCC', 'AAA']
['GGT', 'GGT', 'CAG', 'CGC', 'CGA', 'GGA', 'ACA']

['TAA', 'CCG', 'CCA', 'TTT', 'CGG', 'TTC']
['TTA', 'GCC', 'GTT', 'AGA', 'CAT', 'TAC', 'CAA', 'GTG', 'TCT', 'AAG']

['CCU', 'AUG', 'GAG', 'UGA', 'ACA', 'CUG', 'AAU']
['UUG', 'AGA', 'CCA', 'GCG', 'GAC', 'GAG', 'UAA', 'GCU', 'AGC']

['Leu', 'Ala', 'Leu', 'Arg']
['Asp', 'Arg', 'Ile', 'Arg', 'Pro', 'Ser', 'Urp', 'Gly']

['Ile', 'Arg', 'Pro', 'Met', 'Asp', 'Ile', 'Ser', 'Ser', 'Lys', 'Arg', 'Asp']
['Leu', 'Ile', 'Glu', 'Leu', 'Tyr', 'Ala', 'Ile', 'Val', 'Arg']

Revisit FASTA reader

  • Suppose we want to split the description from a base sequence
In [6]:
def read_FASTA_entries(filename):
    return [seq.partition('\n') for seq in read_FASTA_strings(filename)]
  • Given a string string and another string sepr, the call string.partition(sepr) returns a tuple with three elements:
    • the part of string up to the first appearance of sepr
    • sepr
    • the part of string after sepr.
  • Calling partition with an argument of '\n' will split the description from the base sequence.
In [7]:
seqs = read_FASTA_entries("data/aa003.fasta")
In [8]:
seqs
Out[8]:
[('gi|6693803|gb|AAF24990.1|AF121349_4 (AF121349) late expression factor 5 [Neodiprion sertifer nucleopolyhedrovirus]',
  '\n',
  'MPPCSEKTLKDIEEIFLKFRRKKKWEDLIRYLKYKQPKCVKTFNLTGTGHKYHAMWAYNPITDKREKKQISLDVMKIQEL\nHRITNNNSKLYVEIRKIMTDDHRCPCEEIKNYMQQIAEYKNNRSNKVFNTPPTKIVPNALEKILKNFTINLMIDKKPKKK\nITKSAHTIKHPPVLNIDYEHTLEFAGQTTVKEICKHASLGDTIEIQNRSFDEMVNLYTTCVQCKQMYKIQ\n'),
 ('gi|6693805|gb|AAF24991.1| (AF125506) astacin family metalloendopeptidase FARM-1 [Hydra vulgaris]',
  '\n',
  'MSSSNHIHVLRAIDEYHKHTCLKFVKRTNQDAYLSFYPGGGCSSLVGYVRGRINDVSLAGGCLRLGTVMHEIGHSIGLYH\nEQSRPDRDDHVTIIWNNIQSNMRFNFDKFDRNKINSLGFPYDYESMMHYESNAFGGGQVTIRTKDPSKQKLIGNRQGFSE\nIDKQQINAMYNCNRGGSTLPPSVPPTVSPVAQCVEGQDLDNRCLGWATSGYCTATDPAHLETMKKKCCKSCKESAICNDK\nNTRCDEWAKKGECKANPNWMLGNCSKSCLVC\n'),
 ('gi|6693816|gb|AAF24994.1|AF129447_1 (AF129447) RpoB [Klebsiella ornithinolytica]',
  '\n',
  'AAVKEFFGSSQLSQFMDQNNPLSEITHKRRISALGPGGLTRERAGFEVRDVHPTHYGRVCPIETPEGPNIGLINSLSVYA\nQTNEYGFLETPYRKVTDGVVTDEIHYLSAIEEGNYVIAQANSNLDDEGHFVEDLVTCRSKGESSLFSRDQVDYMDVSTQQ\nVVSVGGSSERVL\n')]
  • Next, we need to remove the newline characters from within each sequence string.
  • Again, we’ll define a new function: it will use str.replace and another list comprehension.
  • We’ll also discard the useless '>' that begins each description.
In [9]:
def read_FASTA_sequences(filename):
    return [[seq[0], seq[2].replace('\n', '')]           # delete newlines
             for seq in read_FASTA_entries(filename)]
In [11]:
seqs = read_FASTA_sequences("data/aa003.fasta")
seqs
Out[11]:
[['gi|6693803|gb|AAF24990.1|AF121349_4 (AF121349) late expression factor 5 [Neodiprion sertifer nucleopolyhedrovirus]',
  'MPPCSEKTLKDIEEIFLKFRRKKKWEDLIRYLKYKQPKCVKTFNLTGTGHKYHAMWAYNPITDKREKKQISLDVMKIQELHRITNNNSKLYVEIRKIMTDDHRCPCEEIKNYMQQIAEYKNNRSNKVFNTPPTKIVPNALEKILKNFTINLMIDKKPKKKITKSAHTIKHPPVLNIDYEHTLEFAGQTTVKEICKHASLGDTIEIQNRSFDEMVNLYTTCVQCKQMYKIQ'],
 ['gi|6693805|gb|AAF24991.1| (AF125506) astacin family metalloendopeptidase FARM-1 [Hydra vulgaris]',
  'MSSSNHIHVLRAIDEYHKHTCLKFVKRTNQDAYLSFYPGGGCSSLVGYVRGRINDVSLAGGCLRLGTVMHEIGHSIGLYHEQSRPDRDDHVTIIWNNIQSNMRFNFDKFDRNKINSLGFPYDYESMMHYESNAFGGGQVTIRTKDPSKQKLIGNRQGFSEIDKQQINAMYNCNRGGSTLPPSVPPTVSPVAQCVEGQDLDNRCLGWATSGYCTATDPAHLETMKKKCCKSCKESAICNDKNTRCDEWAKKGECKANPNWMLGNCSKSCLVC'],
 ['gi|6693816|gb|AAF24994.1|AF129447_1 (AF129447) RpoB [Klebsiella ornithinolytica]',
  'AAVKEFFGSSQLSQFMDQNNPLSEITHKRRISALGPGGLTRERAGFEVRDVHPTHYGRVCPIETPEGPNIGLINSLSVYAQTNEYGFLETPYRKVTDGVVTDEIHYLSAIEEGNYVIAQANSNLDDEGHFVEDLVTCRSKGESSLFSRDQVDYMDVSTQQVVSVGGSSERVL']]
In [12]:
def read_FASTA_sequences_unpacked(filename):
    return [(info, seq.replace('\n', ''))
            for info, ignore, seq in                     # ignore is ignored (!)
            read_FASTA_entries(filename)]
  • The description lines of FASTA files generally contain vertical bars to separate field values.
  • We define a new function that calls read_FASTA_sequences , then uses str.split to return a list of field values for the description instead of just a string.
In [16]:
def read_FASTA_sequences_and_info(filename):
    return [[seq[0].split('|'), seq[1]] for seq in
            read_FASTA_sequences(filename)]
In [17]:
seqs = read_FASTA_sequences_and_info(filename)
print(seqs)
[[['gi', '6693803', 'gb', 'AAF24990.1', 'AF121349_4 (AF121349) late expression factor 5 [Neodiprion sertifer nucleopolyhedrovirus]'], 'MPPCSEKTLKDIEEIFLKFRRKKKWEDLIRYLKYKQPKCVKTFNLTGTGHKYHAMWAYNPITDKREKKQISLDVMKIQELHRITNNNSKLYVEIRKIMTDDHRCPCEEIKNYMQQIAEYKNNRSNKVFNTPPTKIVPNALEKILKNFTINLMIDKKPKKKITKSAHTIKHPPVLNIDYEHTLEFAGQTTVKEICKHASLGDTIEIQNRSFDEMVNLYTTCVQCKQMYKIQ'], [['gi', '6693805', 'gb', 'AAF24991.1', ' (AF125506) astacin family metalloendopeptidase FARM-1 [Hydra vulgaris]'], 'MSSSNHIHVLRAIDEYHKHTCLKFVKRTNQDAYLSFYPGGGCSSLVGYVRGRINDVSLAGGCLRLGTVMHEIGHSIGLYHEQSRPDRDDHVTIIWNNIQSNMRFNFDKFDRNKINSLGFPYDYESMMHYESNAFGGGQVTIRTKDPSKQKLIGNRQGFSEIDKQQINAMYNCNRGGSTLPPSVPPTVSPVAQCVEGQDLDNRCLGWATSGYCTATDPAHLETMKKKCCKSCKESAICNDKNTRCDEWAKKGECKANPNWMLGNCSKSCLVC'], [['gi', '6693816', 'gb', 'AAF24994.1', 'AF129447_1 (AF129447) RpoB [Klebsiella ornithinolytica]'], 'AAVKEFFGSSQLSQFMDQNNPLSEITHKRRISALGPGGLTRERAGFEVRDVHPTHYGRVCPIETPEGPNIGLINSLSVYAQTNEYGFLETPYRKVTDGVVTDEIHYLSAIEEGNYVIAQANSNLDDEGHFVEDLVTCRSKGESSLFSRDQVDYMDVSTQQVVSVGGSSERVL']]
  • Each sequence in the result returned is represented by a two-element list.
    • The first element is a list of the segments of the description
    • The second is the whole sequence with no newline characters.

  • Altogether, the functions we developed to read sequences from a FASTA file do the following:
    • Split the file contents at '>' to get a list of strings representing entries
    • Partition the strings to separate the first line from the rest
    • Remove the useless '>' from the resulting triples
    • Remove the newlines from the sequence data
    • Split the description line into pieces where vertical bars appear
In [3]:
#Reading FASTA sequences with one compact function

def read_FASTA(filename):
    with open(filename) as file:
        return [(part[0].split('|'),
                 part[2].replace('\n', ''))
                for part in
                [entry.partition('\n')
                 for entry in file.read().split('>')[1:]]]
In [4]:
filename = 'data/aa003.fasta'
seqs = read_FASTA(filename)
seqs
Out[4]:
[(['gi',
   '6693803',
   'gb',
   'AAF24990.1',
   'AF121349_4 (AF121349) late expression factor 5 [Neodiprion sertifer nucleopolyhedrovirus]'],
  'MPPCSEKTLKDIEEIFLKFRRKKKWEDLIRYLKYKQPKCVKTFNLTGTGHKYHAMWAYNPITDKREKKQISLDVMKIQELHRITNNNSKLYVEIRKIMTDDHRCPCEEIKNYMQQIAEYKNNRSNKVFNTPPTKIVPNALEKILKNFTINLMIDKKPKKKITKSAHTIKHPPVLNIDYEHTLEFAGQTTVKEICKHASLGDTIEIQNRSFDEMVNLYTTCVQCKQMYKIQ'),
 (['gi',
   '6693805',
   'gb',
   'AAF24991.1',
   ' (AF125506) astacin family metalloendopeptidase FARM-1 [Hydra vulgaris]'],
  'MSSSNHIHVLRAIDEYHKHTCLKFVKRTNQDAYLSFYPGGGCSSLVGYVRGRINDVSLAGGCLRLGTVMHEIGHSIGLYHEQSRPDRDDHVTIIWNNIQSNMRFNFDKFDRNKINSLGFPYDYESMMHYESNAFGGGQVTIRTKDPSKQKLIGNRQGFSEIDKQQINAMYNCNRGGSTLPPSVPPTVSPVAQCVEGQDLDNRCLGWATSGYCTATDPAHLETMKKKCCKSCKESAICNDKNTRCDEWAKKGECKANPNWMLGNCSKSCLVC'),
 (['gi',
   '6693816',
   'gb',
   'AAF24994.1',
   'AF129447_1 (AF129447) RpoB [Klebsiella ornithinolytica]'],
  'AAVKEFFGSSQLSQFMDQNNPLSEITHKRRISALGPGGLTRERAGFEVRDVHPTHYGRVCPIETPEGPNIGLINSLSVYAQTNEYGFLETPYRKVTDGVVTDEIHYLSAIEEGNYVIAQANSNLDDEGHFVEDLVTCRSKGESSLFSRDQVDYMDVSTQQVVSVGGSSERVL')]

Set & Dictionary Comprehensions

  • Set Comprehension - {expression for item in collection}
  • Dictionary Comprehension - {key-expression: value-expression for key, value in collection}
In [21]:
def make_indexed_sequence_dictionary(filename):
    return {info[3]: seq for info, seq in read_FASTA(filename)}

seqs = make_indexed_sequence_dictionary(filename)
seqs
Out[21]:
{'AAF24990.1': 'MPPCSEKTLKDIEEIFLKFRRKKKWEDLIRYLKYKQPKCVKTFNLTGTGHKYHAMWAYNPITDKREKKQISLDVMKIQELHRITNNNSKLYVEIRKIMTDDHRCPCEEIKNYMQQIAEYKNNRSNKVFNTPPTKIVPNALEKILKNFTINLMIDKKPKKKITKSAHTIKHPPVLNIDYEHTLEFAGQTTVKEICKHASLGDTIEIQNRSFDEMVNLYTTCVQCKQMYKIQ',
 'AAF24991.1': 'MSSSNHIHVLRAIDEYHKHTCLKFVKRTNQDAYLSFYPGGGCSSLVGYVRGRINDVSLAGGCLRLGTVMHEIGHSIGLYHEQSRPDRDDHVTIIWNNIQSNMRFNFDKFDRNKINSLGFPYDYESMMHYESNAFGGGQVTIRTKDPSKQKLIGNRQGFSEIDKQQINAMYNCNRGGSTLPPSVPPTVSPVAQCVEGQDLDNRCLGWATSGYCTATDPAHLETMKKKCCKSCKESAICNDKNTRCDEWAKKGECKANPNWMLGNCSKSCLVC',
 'AAF24994.1': 'AAVKEFFGSSQLSQFMDQNNPLSEITHKRRISALGPGGLTRERAGFEVRDVHPTHYGRVCPIETPEGPNIGLINSLSVYAQTNEYGFLETPYRKVTDGVVTDEIHYLSAIEEGNYVIAQANSNLDDEGHFVEDLVTCRSKGESSLFSRDQVDYMDVSTQQVVSVGGSSERVL'}

Generetor expressions

  • A 'generator expression' is syntactically like a list or set comprehension, except that it is surrounded with parentheses and its value is a generator: (expression for item in collection)
In [30]:
### Generating amino acid translations of codons

def aa_generator(rnaseq):
    """Return a generator object that produces an amino acid by
    translating the next three characters of rnaseq each time nextn
    is called on it"""
    return (translate_RNA_codon(rnaseq[n:n+3])
            for n in range(0, len(rnaseq), 3))

seq = 'AUUCGAUCCGGACCCAUGAUCCCG'

print()
print(seq)
gen = aa_generator(seq)
assert 'Ile' == next(gen)
assert 'Arg' == next(gen)
assert 'Ser' == next(gen)
assert 'Gly' == next(gen)
assert 'Pro' == next(gen)
assert 'Met' == next(gen)
assert 'Ile' == next(gen)

gen = aa_generator(seq)
print(''.join(list(gen)))
AUUCGAUCCGGACCCAUGAUCCCG
IleArgSerGlyProMetIlePro

Conditional Comprehensions

  • One or more tests can be added to determine the elements for which the comprehension expression will be evaluated.
  • This is called filtering and is implemented with a conditional comprehension

[expression for element in collection if test]

Extract just sequence descriptions from a FASTA file and split them into fields at their vertical bars.

In [32]:
### Reading FASTA descriptions from a file
def get_FASTA_descriptions(filename):
    with open(filename) as file:
        return [line[1:].split('|') for line in file if line[0] == '>']

print(get_FASTA_descriptions('data/aa010.fasta'))
[['gi', '6693791', 'gb', 'AAF24984.1', 'AF082179_1 (AF082179) HepA-related protein HARP [Homo sapiens]\n'], ['gi', '6693793', 'gb', 'AAF24985.1', 'AF088884_1 (AF088884) HepA-related protein Harp [Mus musculus]\n'], ['gi', '6693798', 'gb', 'AAF24986.1', 'AF116242_1 (AF116242) K-Cl cotransporter KCC3 [Homo sapiens]\n'], ['gi', '6693800', 'gb', 'AAF24987.1', 'AF121349_1 (AF121349) HOAR-like protein [Neodiprion sertifer nucleopolyhedrovirus]\n'], ['gi', '6693801', 'gb', 'AAF24988.1', 'AF121349_2 (AF121349) ORF22 [Neodiprion sertifer nucleopolyhedrovirus]\n'], ['gi', '6693802', 'gb', 'AAF24989.1', 'AF121349_3 (AF121349) late expression factor 2 [Neodiprion sertifer nucleopolyhedrovirus]\n'], ['gi', '6693803', 'gb', 'AAF24990.1', 'AF121349_4 (AF121349) late expression factor 5 [Neodiprion sertifer nucleopolyhedrovirus]\n'], ['gi', '6693805', 'gb', 'AAF24991.1', ' (AF125506) astacin family metalloendopeptidase FARM-1 [Hydra vulgaris]\n'], ['gi', '6693816', 'gb', 'AAF24994.1', 'AF129447_1 (AF129447) RpoB [Klebsiella ornithinolytica]\n'], ['gi', '6693818', 'gb', 'AAF24995.1', 'AF129448_1 (AF129448) RpoB [Klebsiella terrigena]\n']]
In [5]:
### Reading FASTA descriptions using set comprehension (Read 3rd field)

def get_FASTA_codes(filename):
    with open(filename) as file:
        return {line.split('|')[3] for line in file
                if line[0] == '>' and len(line.split('|')) > 2}

#print(get_FASTA_codes('data/BacillusSubtilisPlastmidP1414.fasta'))
print(get_FASTA_codes('data/aa010.fasta'))
{'AAF24991.1', 'AAF24990.1', 'AAF24995.1', 'AAF24987.1', 'AAF24988.1', 'AAF24989.1', 'AAF24986.1', 'AAF24994.1', 'AAF24984.1', 'AAF24985.1'}
In [35]:
### Constructing a selective dictionary

def make_gi_indexed_sequence_dictionary(filename):
    return {info[1]: seq for info, seq in read_FASTA(filename)
            if len(info) >= 2 and info[0] == 'gi'}

print(make_gi_indexed_sequence_dictionary('data/aa003.fasta'))
{'6693803': 'MPPCSEKTLKDIEEIFLKFRRKKKWEDLIRYLKYKQPKCVKTFNLTGTGHKYHAMWAYNPITDKREKKQISLDVMKIQELHRITNNNSKLYVEIRKIMTDDHRCPCEEIKNYMQQIAEYKNNRSNKVFNTPPTKIVPNALEKILKNFTINLMIDKKPKKKITKSAHTIKHPPVLNIDYEHTLEFAGQTTVKEICKHASLGDTIEIQNRSFDEMVNLYTTCVQCKQMYKIQ', '6693805': 'MSSSNHIHVLRAIDEYHKHTCLKFVKRTNQDAYLSFYPGGGCSSLVGYVRGRINDVSLAGGCLRLGTVMHEIGHSIGLYHEQSRPDRDDHVTIIWNNIQSNMRFNFDKFDRNKINSLGFPYDYESMMHYESNAFGGGQVTIRTKDPSKQKLIGNRQGFSEIDKQQINAMYNCNRGGSTLPPSVPPTVSPVAQCVEGQDLDNRCLGWATSGYCTATDPAHLETMKKKCCKSCKESAICNDKNTRCDEWAKKGECKANPNWMLGNCSKSCLVC', '6693816': 'AAVKEFFGSSQLSQFMDQNNPLSEITHKRRISALGPGGLTRERAGFEVRDVHPTHYGRVCPIETPEGPNIGLINSLSVYAQTNEYGFLETPYRKVTDGVVTDEIHYLSAIEEGNYVIAQANSNLDDEGHFVEDLVTCRSKGESSLFSRDQVDYMDVSTQQVVSVGGSSERVL'}
In [36]:
### Using a generator to find the first common element

def first_common(collection1, collection2):
    """Return the first element in collection1 that is in collection2"""
    return next((item for item in collection1 if item in collection2), None)

print(first_common(range(1,22, 5), range(0, 22, 4)))
16

Nested Comprehensions

  • Comprehensions can have more than one for in them. When they do, the innermost ranges over its collection for each of the next one out’s collection, and so on.
  • It is rare to see more than two for sections in a comprehension, but occasionally three can be useful.
In [6]:
### A nested comprehension for generating codons

def generate_triples(chars='TCAG'):
    """Return a list of all three-character combinations of unique
    characters in chars"""
    chars = set(chars)
    return [b1 + b2 + b3 for b1 in chars for b2 in chars for b3 in chars]

print(generate_triples())
['CCC', 'CCA', 'CCG', 'CCT', 'CAC', 'CAA', 'CAG', 'CAT', 'CGC', 'CGA', 'CGG', 'CGT', 'CTC', 'CTA', 'CTG', 'CTT', 'ACC', 'ACA', 'ACG', 'ACT', 'AAC', 'AAA', 'AAG', 'AAT', 'AGC', 'AGA', 'AGG', 'AGT', 'ATC', 'ATA', 'ATG', 'ATT', 'GCC', 'GCA', 'GCG', 'GCT', 'GAC', 'GAA', 'GAG', 'GAT', 'GGC', 'GGA', 'GGG', 'GGT', 'GTC', 'GTA', 'GTG', 'GTT', 'TCC', 'TCA', 'TCG', 'TCT', 'TAC', 'TAA', 'TAG', 'TAT', 'TGC', 'TGA', 'TGG', 'TGT', 'TTC', 'TTA', 'TTG', 'TTT']
In [7]:
print(set(generate_triples()))
{'GCC', 'CTT', 'AAC', 'TCA', 'GGT', 'CGT', 'GAC', 'TCT', 'AGG', 'CCT', 'TAT', 'TTC', 'ACT', 'CAG', 'TTT', 'GGA', 'GAA', 'CGC', 'GGG', 'CGG', 'TGG', 'CCG', 'TAC', 'GCG', 'TGA', 'AAT', 'GTT', 'GTG', 'CAA', 'GTC', 'AGT', 'CGA', 'CTC', 'TTA', 'CTG', 'ATA', 'GGC', 'TAA', 'TGC', 'GAG', 'CCA', 'AGA', 'GCA', 'TGT', 'CCC', 'AAA', 'CAC', 'AAG', 'ACC', 'ACG', 'TAG', 'TTG', 'CTA', 'AGC', 'GCT', 'ATG', 'GAT', 'GTA', 'TCG', 'CAT', 'TCC', 'ACA', 'ATT', 'ATC'}

Functional Parameters

  • We will discuss the thekey parameter associated with few functions & methods #### key parameter
In [1]:
max(range(3, 7), key=abs)
Out[1]:
6
In [8]:
max(range(-7, 3))
Out[8]:
2
In [2]:
max(range(-7, 3), key=abs)
Out[2]:
-7

The value of the key argument is called on each element of the collection.


Consider a list seq_list containing RNA base sequences. We could select the sequence with the lowest GC content by calling min with key = gc_content.

The method list.sort was described. It too can take an optional key parameter - the value of key is called on each element, and the elements of the list are sorted according to the values returned.

In [9]:
lst = ['T', 'G', 'A', 'G', 't', 'g', 'a', 'g']
In [10]:
lst
Out[10]:
['T', 'G', 'A', 'G', 't', 'g', 'a', 'g']
In [11]:
lst.sort()
In [12]:
lst 
Out[12]:
['A', 'G', 'G', 'T', 'a', 'g', 'g', 't']
In [7]:
lst.sort(key=str.lower)
In [8]:
lst
Out[8]:
['A', 'a', 'G', 'G', 'g', 'g', 'T', 't']
In [9]:
seqs = ['TACCTATACCGGCTA', 'cacctctaccgta', 'AACCTGTCCGGCTA']
seqs.sort()
seqs
Out[9]:
['AACCTGTCCGGCTA', 'TACCTATACCGGCTA', 'cacctctaccgta']
In [10]:
seqs = ['TACCTATACCGGCTA', 'cacctctaccgta', 'AACCTGTCCGGCTA']
seqs.sort(key = str.lower)
seqs
Out[10]:
['AACCTGTCCGGCTA', 'cacctctaccgta', 'TACCTATACCGGCTA']

Anonymous Functions

  • functions and methods are objects that can be passed down as parameters
  • The def statement creates a function object and names it.
  • There are many situations in which a functional parameter avoids a great deal of repetitive code.

  • Python has a mechanism for defining lightweight functions without using def.
  • These functions don’t have names - they are anonymous.
  • Although they are functions, they are defined by an expression, not a statement. This kind of expression is called a lambda expression.
    Syntax: lambda args: expression-using-args
In [11]:
def fn (x,y):
    return x*x + y*y

fn = lambda x, y: x*x + y*y
In [12]:
### Definition of a function with a functional argument

def some(coll, pred=lambda x: x):
    """Return true if pred(item) is true for some item in coll"""
    return next((True for item in coll if pred(item)), False)

print()
print('some(range(5)) is', some(range(5)))
print('some((None, '', 0)) is', some((None, '', 0)))
print('some(range(5), lambda x: x > 5) is', some(range(5), lambda x: x > 5))
print('some(range(5), lambda x: x > 3) is', some(range(5), lambda x: x > 3))
some(range(5)) is True
some((None, , 0)) is False
some(range(5), lambda x: x > 5) is False
some(range(5), lambda x: x > 3) is True

Sorting a list of strings in mixed case, suppose we want to order them by size first, and then alphabetically

In [13]:
l = [(3, 'abc'), (5, 'ghijk'), (5, 'abcde'), (2, 'bd')]
In [14]:
l.sort()
l
Out[14]:
[(2, 'bd'), (3, 'abc'), (5, 'abcde'), (5, 'ghijk')]
In [15]:
l = ['abc', 'ghijk', 'abcde', 'bd']
In [16]:
l.sort(key=lambda seq:(len(seq), seq.lower()))
l    
Out[16]:
['bd', 'abc', 'abcde', 'ghijk']